Heatmap function
create_gene_heatmap <- function(de_results_df, dge_object, font_size = 12, num_genes = 20) {
# Subset the genes
top_genes <- de_results_df$Symbol[1:num_genes]
subset_matrix <- dge_object$logCPM[top_genes, colnames(dge_object$counts)]
# Create annotation dataframe
sample_info <- dge_object$samples
annotation_df <- data.frame(
Timepoint = sample_info$Timepoint,
Donor = sample_info$Donor,
Subset = sample_info$Subset
)
rownames(annotation_df) <- colnames(subset_matrix)
# Reorder gene expression matrix
ordered_samples <- order(sample_info$Timepoint_day, sample_info$Subset, sample_info$Donor)
subset_matrix <- subset_matrix[, ordered_samples]
sample_info <- sample_info[ordered_samples, ]
# Create the heatmap
heatmap <- pheatmap(subset_matrix,
scale = "row",
cluster_rows = FALSE,
cluster_cols = FALSE,
color = viridis(50),
show_colnames = FALSE,
annotation_col = annotation_df,
fontsize = font_size,
silent = FALSE)
}The aim of this notebook is to identify genes that change over time upon T cell stimulation.
I sequenced these samples with Prime-seq on extracted RNAs already. Results were not that great. Here reprocess the same samples with TIRE-seq for a head to head comparison.
The processing went as planned. Full writeup available at ELN link
This was generated in 2A_clustering
Workflow is from https://hackmd.io/@9RO7DTzsQ_2CaEAsT6y6NA/Bk0pW5gZn#Exercise-2---Cubic-Splines
This is a more relevant analysis strategy for timecourse experiments where we fit a polynominal curve to each timepoint.
sce <- readRDS(here::here(
"data/TIRE_Tcell/SCEs/tcell_cluster.sce.rds"
))
dge <- scran::convertTo(sce, type="edgeR")
tb <- as_tibble(dge$samples)Check on the perturbations conducted in this experiment:
Filter for genes that have at least 5 counts.
Currently keep 8,249
## Mode FALSE TRUE
## logical 12666 8249
Correct for composition biases by computing normalization factors with the trimmed mean of M-values method.
Check the MDS plot.
Day 0, 1 and 2 the times with the most difference.
Need to recode the timepoint to a numeric for the spline fitting to occur.
dge$samples$Timepoint_day <- as.numeric(str_split(
string = dge$samples$Timepoint, pattern = "_", simplify = TRUE)[,2]
)Use a cubic regression spline curve with 3 degrees of freedom.
According to the edgeR manual this is a good place to start.
Checked a few different degrees of freedom and 4 fits the data well
splines <- ns(dge$samples$Timepoint_day, df = 4)
design <- model.matrix(~splines + Donor + Subset, dge$samples)
#design <- model.matrix(~splines + Donor + Subset, dge$samples)
head(design)## (Intercept) splines1 splines2 splines3 splines4
## AAAGACTTCG 1 0.00000000 0.00000000 0.0000000 0.00000000
## AAAGTGTGTG 1 0.65304487 -0.01266043 0.2267403 -0.12754144
## AACGAGCCTG 1 0.71153846 0.15028422 0.1253577 -0.07051369
## AACGTCAGCC 1 0.00000000 0.00000000 0.0000000 0.00000000
## AAGATTTGCC 1 0.00976801 0.20863621 0.3397926 0.44180316
## ACAAAGAGAG 1 0.00000000 -0.16321244 0.3730570 0.79015544
## DonorDonor_64 SubsetCD8
## AAAGACTTCG 1 0
## AAAGTGTGTG 1 1
## AACGAGCCTG 0 1
## AACGTCAGCC 0 1
## AAGATTTGCC 0 1
## ACAAAGAGAG 0 0
The four spline coefficients do not have any particular meaning. Hypothesis testing would only make sense if the three coefficients are assessed together. The advantage of using a cubic spline curve is that it provides more stable fit at the end points compared to a polynomial.
Compared to the edgeR example on timecourse analysis the biological variation is pretty low.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04159 0.04381 0.04962 0.05177 0.05700 0.07432
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3460 0.7951 0.8640 0.8726 0.9391 1.4263
In a timecourse experiment, we are looking for genes that change
expression level over time. Here, the design matrix uses 3 natural
spline basis vectors to model smooth changes over time, without assuming
any particular pattern to the trend.
We test for a trend by conducting F-tests on 3 df for each gene:
The topTags function lists the top set of genes that vary the most over time.
results <- as_tibble(as.data.frame(
topTags(fit, n=length(fit$genes$Symbol))
))
results[results$FDR < 0.1,]The total number of genes with significant (5% FDR) changes at different doses can be examined with decideTests.
There are 8,675 differentially expressed genes across timepoints
## splines4-splines3-splines2-splines1
## NotSig 1579
## Sig 6670
Note that all three spline coefficients should be tested together in
this way. It is not meaningful to replace the F-tests with t-tests for
the individual coefficients, and similarly the logFC columns of the top
table do not have any interpretable meaning.
The trends should instead be interpreted by way of trend plots, as we
show now.
Finally, we visualize the fitted spline curves for the top four genes.
We start by computing the observed and fitted log-CPM values for each
gene:
top_genes <- head(results$Symbol,9)
logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)
# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[top_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
dplyr::rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[top_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
dplyr::rename(fit_cpm = count)
lcm.obs$fit_cpm <- lcm.fit$fit_cpmAdd back the sample metadata
Set up the gene expression matrix to view a heatmap of early timepoints
Generate the plot linear x axis. Both genes go up and down in this plot
plt3 <- ggplot(lcm) +
geom_point(aes(y = obs_cpm, x=Timepoint_day, colour=Subset)) +
scale_colour_Publication() +
geom_smooth(aes(y = fit_cpm, x=Timepoint_day), method = "loess", se = FALSE, color = "red") +
xlab("Timepoint_day") + ylab("logCPM") +
theme_Publication() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
theme(strip.text = element_text(face = "bold.italic"))
plt3Prepare a heatmap of the the genes that change during the timecourse
Focus on genes that go up upon stimulation with CD28 and CD3 beads.
up <- results %>%
filter(FDR < 0.05) %>%
filter(logFC.splines2 > 0)
up_genes <- head(up$Symbol,4)
logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)
# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[up_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
dplyr::rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[up_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
dplyr::rename(fit_cpm = count)
lcm.obs$fit_cpm <- lcm.fit$fit_cpm
lcm <- left_join(lcm.obs, dge$samples[,c("Well_BC", "Timepoint_day", "Subset", "Receptor", "Donor")],
by=c("sample" = "Well_BC"))Visualise the results
plt4 <- ggplot(lcm) +
geom_point(aes(y = obs_cpm, x=Timepoint_day, colour=Subset)) +
scale_colour_Publication() +
geom_smooth(aes(y = fit_cpm, x=Timepoint_day), method = "loess", se = FALSE, color = "red") +
xlab("Timepoint_day") + ylab("logCPM") +
theme_Publication() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
theme(strip.text = element_text(face = "bold.italic"))
plt4Prepare a heatmap of the the genes that go up.
This is a little less clear because the cells are thawed from liquid nitrogen so start off in a rough state.
down <- results %>%
filter(FDR < 0.05) %>%
filter(logFC.splines2 < 0)
down_genes <- head(down$Symbol,4)
logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)
# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[down_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
dplyr::rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[down_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
dplyr::rename(fit_cpm = count)
lcm.obs$fit_cpm <- lcm.fit$fit_cpm
lcm <- left_join(lcm.obs, dge$samples[,c("Well_BC", "Timepoint_day", "Subset", "Receptor", "Donor")],
by=c("sample" = "Well_BC"))plt5 <- ggplot(lcm) +
geom_point(aes(y = obs_cpm, x=Timepoint_day, colour=Subset)) +
scale_colour_Publication() +
geom_smooth(aes(y = fit_cpm, x=Timepoint_day), method = "loess", se = FALSE, color = "red") +
xlab("Timepoint_day") + ylab("logCPM") +
theme_Publication() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
theme(strip.text = element_text(face = "bold.italic"))
plt5Prepare a heatmap of the the genes that change during the
timecourse.
This is really not clear in the heatmap visulaisation.
This is where the majority of the changes are taking place based on the spline analysis
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 9.4 (Plow)
##
## Matrix products: default
## BLAS: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRblas.so
## LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRlapack.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid splines stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] scales_1.3.0 ggthemes_5.1.0
## [3] here_1.0.1 knitr_1.48
## [5] viridis_0.6.5 viridisLite_0.4.2
## [7] patchwork_1.3.0 pheatmap_1.0.12
## [9] edgeR_4.2.2 limma_3.60.6
## [11] scater_1.32.1 scuttle_1.14.0
## [13] lubridate_1.9.3 forcats_1.0.0
## [15] stringr_1.5.1 dplyr_1.1.4
## [17] purrr_1.0.2 readr_2.1.5
## [19] tidyr_1.3.1 tibble_3.2.1
## [21] ggplot2_3.5.1 tidyverse_2.0.0
## [23] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [25] Biobase_2.64.0 GenomicRanges_1.56.2
## [27] GenomeInfoDb_1.40.1 IRanges_2.38.1
## [29] S4Vectors_0.42.1 BiocGenerics_0.50.0
## [31] MatrixGenerics_1.16.0 matrixStats_1.4.1
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.4
## [3] magrittr_2.0.3 compiler_4.4.1
## [5] mgcv_1.9-1 DelayedMatrixStats_1.26.0
## [7] vctrs_0.6.5 pkgconfig_2.0.3
## [9] crayon_1.5.3 fastmap_1.2.0
## [11] XVector_0.44.0 labeling_0.4.3
## [13] utf8_1.2.4 rmarkdown_2.28
## [15] tzdb_0.4.0 UCSC.utils_1.0.0
## [17] ggbeeswarm_0.7.2 bluster_1.14.0
## [19] xfun_0.48 zlibbioc_1.50.0
## [21] cachem_1.1.0 beachmat_2.20.0
## [23] jsonlite_1.8.9 highr_0.11
## [25] DelayedArray_0.30.1 BiocParallel_1.38.0
## [27] cluster_2.1.6 irlba_2.3.5.1
## [29] parallel_4.4.1 R6_2.5.1
## [31] bslib_0.8.0 stringi_1.8.4
## [33] RColorBrewer_1.1-3 jquerylib_0.1.4
## [35] Rcpp_1.0.13 igraph_2.0.3
## [37] Matrix_1.7-0 timechange_0.3.0
## [39] tidyselect_1.2.1 rstudioapi_0.17.0
## [41] abind_1.4-8 yaml_2.3.10
## [43] codetools_0.2-20 lattice_0.22-6
## [45] withr_3.0.1 evaluate_1.0.1
## [47] pillar_1.9.0 generics_0.1.3
## [49] rprojroot_2.0.4 hms_1.1.3
## [51] sparseMatrixStats_1.16.0 munsell_0.5.1
## [53] glue_1.8.0 metapod_1.12.0
## [55] tools_4.4.1 BiocNeighbors_1.22.0
## [57] ScaledMatrix_1.12.0 locfit_1.5-9.10
## [59] scran_1.32.0 colorspace_2.1-1
## [61] nlme_3.1-164 GenomeInfoDbData_1.2.12
## [63] beeswarm_0.4.0 BiocSingular_1.20.0
## [65] vipor_0.4.7 cli_3.6.3
## [67] rsvd_1.0.5 fansi_1.0.6
## [69] S4Arrays_1.4.1 gtable_0.3.5
## [71] sass_0.4.9 digest_0.6.37
## [73] dqrng_0.4.1 SparseArray_1.4.8
## [75] ggrepel_0.9.6 farver_2.1.2
## [77] htmltools_0.5.8.1 lifecycle_1.0.4
## [79] httr_1.4.7 statmod_1.5.0